The big question that we would like to answer with this analysis is that: What are the socieconomic and demographic factors that influence access to mental healthcare providers?
In this notebook, will try to answer this question by:
In this section, we will:
# Import Libraries
from IPython.display import display
# Import arcgis
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap
# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np
# Import plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Import library for time
import time
# Create connection
gis = GIS("https://datascienceqa.esri.com/portal", "portaladmin", "esri.agp", verify_cert=False)
To understand shortage of healthcare providers we will try to identify socioeconomic and demographic factors that influence access to these providers. We will use Esri's Living Atlas data layers to access socioeconomic, demographic, health expenditure and combine it with provider data for this analysis.
# Search for the data layer
allsearch_result = gis.content.search('title: demographic_healthexp_clean_allproviders')
allsearch_result
# Get the layer
allprovider = allsearch_result[1]
allprovider
allprovider_layer = allprovider.layers[0]
for f in allprovider_layer.properties.fields[:5]:
print(f['name'], '\t',f['alias'])
# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider_layer.query(as_df=True)
allprovider_df.shape
# Look at the first few rows of dataframe
allprovider_df.head()
We can see that the dataframe (allprovider_df) has 3139 rows and 41 columns.
Before we plot the data, we will create a copy of allprovider_df and remove objectid and SHAPE columns
# Create copy of dataframe
test_newcounty_df = allprovider_df.copy()
# Drop the shape and objectid columns
test_newcounty_df.drop(['objectid','SHAPE'], axis=1, inplace=True)
test_newcounty_df.columns
We will plot each numerical variable with respect to others to see how the data is distributed and how correlated the variables are with each other.
sns.pairplot(test_newcounty_df.iloc[:,0:9])
sns.pairplot(test_newcounty_df.iloc[:,10:19])
sns.pairplot(test_newcounty_df.iloc[:,20:])
From these pair plots we can see that:
Correlation plot is another great way to visualize correlation among predictor variables and correlation of predictors with response variable.
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(test_newcounty_df.corr(), vmin=-1.0, vmax=1.0, annot=True, linewidths=.5, fmt= '.1f',ax=ax)
From this plot, we can see that:
# Sort the values by descending order of Provider Count
test_newcounty_df = test_newcounty_df.sort_values(by=['provider_count'], ascending=False)
test_newcounty_df.head()
The idea behind a global model is to identify various socioeconomic and demographic factors that influence access to healthcare providers across all counties in the United States.
To build this global model, we will follow a 3 step process:
# Create prdictor and response variables
train_x = test_newcounty_df.iloc[:,2:-1]
train_y = test_newcounty_df.iloc[:,-1]
train_x.head()
train_y.head()
This is our first model where we will build an Ordinary Least Squares (OLS) Regression Model with all predictors and verify regressions assumptions
# Import libraries
import statsmodels.api as sm
import statsmodels
from statsmodels.regression import linear_model
# Create Model
X_train = train_x
X_train = sm.add_constant(X_train) # add constant
sm_ols = sm.OLS(train_y, X_train).fit()
# Generate model summary
sm_ols.summary()
# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
sm_ols.params[1:].sort_values(ascending=False).plot(kind='barh')
plt.title('Model Coefficients - Base Model')
# Calculate RMSE of model
from statsmodels.tools.eval_measures import rmse
pred_val = sm_ols.fittedvalues.copy()
rmse_base_ols = rmse(train_y, pred_val)
round(rmse_base_ols,2)
Analysing the model summary, we can see that:
- The R-squared value of 0.972 shows that 97.2% of the variability in Provider Count is explained by the model. The R-squared value is too good to be true and the model seems to be overfitting.
- To identify variables that are statistically significant, we look at the p-values of individual variables. Among the coefficients, those that are statistically significant at 10% significance level are:
- asian_pop, amerind_pop, black_pop, edubase, hisp_pop, median_age, minority_pop, otherace_pop, percap_income, pop_density, white_pop, some_college, asso_deg, grad_deg, total_population, avg_labtest, avg_presdrug, avg_socsecurity.
- The p-value of F-statistic is less than 0.05 which shows that atleast one of the predicting variables has predicting power on the variability of Provider Count.
Here we will verify regression assumtions of:
# Get residual value
residual = sm_ols.resid
import scipy as sp
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))
# Residuals vs Fitted
ax1.scatter(pred_val, residual)
ax1.hlines(y=0, xmin=0, xmax=max(pred_val), linewidth=2, color='r')
ax1.set_xlabel('Fitted Values')
ax1.set_ylabel('Residuals')
ax1.set_title('Residuals vs Fitted Values')
# QQ plot
sp.stats.probplot(residual, plot=ax2, fit=True)
# Check Linearity - Plot predictors with response
states = test_newcounty_df.columns[2:-1]
fig = plt.figure(figsize=(15, 15))
for sp in range(0,27):
ax = fig.add_subplot(7,4,sp+1)
ax.scatter(test_newcounty_df.loc[:,states[sp]], test_newcounty_df.iloc[:,-1])
ax.set_xlabel(states[sp])
ax.set_ylabel('Provider Count')
plt.xticks(rotation=0)
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.margins(0.01, 0)
plt.tight_layout() # automatically adjusts layout to fit long labels
plt.show()
One way to think about whether the results we have are driven by a given data point is to calculate how far the predicted values for data would move if model was fit without the data point in question. This calculated total distance is called Cook's distance. Cook's D is a function of the leverage and standardized residual associated with each data point. The influence of each point can be visualized using an Influence Plot.
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(sm_ols, ax= ax, criterion="cooks")
From this plot, we can see that points towards the right have high leverage and higher than average residuals. The size of points like 1198, 2353 is large indicating these are influential points. Points like 916 do not have a high leverage but very high residual and are considered outliers.
Removing outliers is an iterative process and outliers need to be investigated before removal. Here, we followed a 4-step process to remove outliers as follows:
The outliers were removed in different iterations. Here are the obervations that were removed with each iteration:
- Iteration 1: 198,2353,2459,1077,1688
- Iteration 2: 2262,1614,916,2710,1846,2784,1601,1405
- Iteration 3: 1512,1727,2204,2613,1178,2632,797,1590,1052
- Iteration 4: 716,209,583,1348,1913,1355,310,1207,388,2423
- Iteration 5: 1659,1698,629,2646,1632,38
The code below shows all outliers being removed.
# Remove Outliers
global_df = test_newcounty_df.copy()
global_df.drop([1198,2353,2459,1077,1688,2262,1614,916,2710,1846,2784,1601,1405,1512,1727,2204,2613,
1178,2632,797,1590,1052,716,209,583,1348,1913,1355,310,1207,388,2423,1659,1698,629,
2646,1632,38], axis=0, inplace=True)
global_df.shape
# Create prdictor and response variables
train_x_rerun = global_df.iloc[:,2:-1]
train_y_rerun = global_df.iloc[:,-1]
train_x_rerun.head()
# Run Model
X_train_rerun = train_x_rerun
X_train_rerun = sm.add_constant(X_train_rerun)
global_ols_rerun = sm.OLS(train_y_rerun, X_train_rerun).fit()
residual = global_ols_rerun.resid
# Plot Outliers
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(global_ols_rerun, ax= ax, criterion="cooks")
# Check Linearity - Plot predictors with response
states = global_df.columns[2:-1]
fig = plt.figure(figsize=(15, 15))
for sp in range(0,27):
ax = fig.add_subplot(7,4,sp+1)
# provType = unique_df[unique_df['state']==states[sp]]['provider_type'].value_counts().reset_index()
# ax.scatter(test_newcounty_df.iloc[:,2], test_newcounty_df.loc[:,states[sp]])
ax.scatter(global_df.loc[:,states[sp]], global_df.iloc[:,-1])
ax.set_xlabel(states[sp])
ax.set_ylabel('Provider Count')
# ax.set_ylim(0,provType.iloc[0,1])
plt.xticks(rotation=0)
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.margins(0.01, 0)
plt.tight_layout() # automatically adjusts layout to fit long labels
plt.show()
Our regression model results showed only few predictors that were statistically significant in predicting provider count. Feature selection is used to select those features in data that contribute most to the response variable. Having irrelevant features in the data can decrease the accuracy of many models, especially linear algorithms. To identify which predictors play a key role and remove predictors that are not statistically significant, we can use various feature selection techniques.
Greedy algorithms like Stepwise Regression and Recursive Feature Elimination (RFE) work by adding or removing attributes one at a time and building a model on those attributes that remain. RFE algorithm recursively removes attributes and uses accuracy metric to rank the feature according to their importance.
Regularization methods like Lasso and ElasticNet seek to minimize the complexity (magnitude and number of regression coefficients) of the model by penalizing a feature given a coefficient threshold.
LASSO penalizes the sum of absolute value of regression coefficients thereby forcing many coefficients to 0.
ElasticNet combines the properties of both LASSO and Ridge regression. It penalizes the model by using both the L2-norm (sum of squared values of coefficients) and the L1-norm (sum of absolute value of coefficients) thereby shrinking some coefficients closer to 0 to reduce variance and making other coefficients 0.
Feature importance techniques likeRandom Forest are used to select features using a trained supervised classifier. Random forest consists of a number of decision trees where each node in a tree is a condition on a single feature, designed to split the dataset.
Here we will:
# Create prdictor and response variables
train_x_global = train_x_rerun
train_y_global = train_y_rerun
train_x_global.head()
# Import Libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
# Test options and evaluation metric
num_folds = 10
seed = 7
scoring ='neg_mean_squared_error'
# Create Pipeline and Standardize the dataset
clf = LinearRegression()
pipelines = []
pipelines.append(('ScaledLASSO', Pipeline([('Scaler', StandardScaler()),('LASSO',Lasso(max_iter=10000))])))
pipelines.append(('ScaledEN', Pipeline([('Scaler', StandardScaler()),('EN',ElasticNet())])))
pipelines.append(('ScaledRFECV', Pipeline([('Scaler', StandardScaler()),('RFECV',RFECV(estimator=clf, cv=5))])))
pipelines.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),('RF',RandomForestRegressor(n_estimators = 100))])))
# Run models
import math
results = []
names = []
for name, model in pipelines:
kfold = KFold(n_splits=num_folds, random_state=seed)
cv_results = cross_val_score(model, train_x_global, train_y_global, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, math.sqrt(abs(cv_results.mean())), cv_results.std())
print(msg)
# print("RMSE: %.3f" % (math.sqrt(abs(cv_results.mean()))))
From the results above, Lasso seems to be the most promising with lowest RMSE. Let's explore Lasso to see which variables are selected by this model.
# Standardize the data using Standard Scalar
from sklearn.preprocessing import StandardScaler
sc_data = StandardScaler()
train_x_std = sc_data.fit_transform(train_x_global)
# Run Lasso Model
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
lasso_model=Lasso(max_iter=10000)
lasso_model.fit(train_x_std,train_y_global)
# Identify how many variables are picked
coef = pd.Series(lasso_model.coef_, index = train_x.columns)
print("Model picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
# Get important variables and their coefficients
imp_coef = pd.concat([coef.sort_values(ascending=False).head(10),
coef.sort_values(ascending=False).tail(10)])
imp_coef
# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Model")
From the analysis above, we can see that factors that are predictive of higher provider count are:
Factors that negatively influence provider count are:
In the base model, we saw deviations from linearity, constant variance and independence assumptions. The model also seemed to overfit data with an R-squared value of 0.97.
For Linearity - If a model does not fit well, it does not mean that the regression is not useful. One problem could be that the relation between one or more predictors and the response is not linear.
For Constant Variance and Independence - If normality or the constant variance do not hold, then we transform the response variable. A common transformation is the power transformation of y to the lambda, also called the Box-Cox transformation.
Here we will use the features selected by lasso model, and transform both the predictor and response to create a regression model on our data.
Some transformations such as log do not work well with zero or negative values. We do not have any negatives but let's check and remove any zero values from the data.
# Create a copy of dataframe
test_log = global_df.copy()
test_log.shape
# Describe to identify any zero values in dataset
test_log.describe()
We can see that the minimum for asian, black population and some other variables is 0. Let's find out the observations with these zero values and remove them.
# Find observations that have 0 values
state_list = ['asian_pop','black_pop','amerind_pop','hisp_pop','otherace_pop','pop_density','unemp_rate','grad_deg']
x = test_log[(test_log[state_list] == 0).any(axis=1)]
x.shape
We can see that 50 out of 3101 observations have 0 values. We will go ahead and remove these.
Replace 0 values with Nan and then drop those values
# Replace 0 with NaN
test_log['black_pop'].replace(0.0, np.nan, inplace=True)
test_log['asian_pop'].replace(0.0, np.nan, inplace=True)
test_log['amerind_pop'].replace(0.0, np.nan, inplace=True)
test_log['hisp_pop'].replace(0.0, np.nan, inplace=True)
test_log['otherace_pop'].replace(0.0, np.nan, inplace=True)
test_log['pop_density'].replace(0.0, np.nan, inplace=True)
test_log['unemp_rate'].replace(0.0, np.nan, inplace=True)
test_log['grad_deg'].replace(0.0, np.nan, inplace=True)
# Drop NaN
test_log.dropna(inplace=True)
test_log.shape
test_log.head()
test_log.describe()
Now that 0 valued observations are removed, we can see that minimum value is 1.
# Subset important predictors chosen from Lasso Model
train_x_lasso = test_log.loc[:,imp_coef.index]
train_y_lasso = test_log.iloc[:,-1]
train_x_lasso.head()
# Transform data
from numpy import log
sc_data = StandardScaler()
train_x_log = np.log(train_x_lasso)
train_y_log = np.log(test_log.iloc[:,-1])
# Distribution of Dependent variable before and after transformation
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))
sns.distplot(train_y_log, ax=ax2)
sns.distplot(train_y_lasso, ax=ax1)
From the plots above, we can see how the distribution of provider count varies before and after log transformation.
In this model, we will build an Ordinary Least Squares (OLS) Regression Model using predictors selected from feature selection
X_train_log = sm.add_constant(train_x_log)
global_ols = sm.OLS(train_y_log, X_train_log).fit()
global_ols.summary()
Analysing the model summary, we can see that:
- The Adjusted R2 of our global model is 94.9% which means that 94.9% of variation in provider count can be explained by our model.
- The p-value of F-statistic is less than 0.05 which shows that atleast one of the predicting variables has predicting power on the variability of Provider Count.
- The p-values of individual variables determine which variables are statistically significant. The equation below is the final regression equation for our model showing statistically significant variables and their coefficients. Note that both dependent and independent variables are log transformed.
# Plot coefficients picked by model along with their importance
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
global_ols.params[1:].sort_values(ascending=False).plot(kind='barh')
plt.title('Model Coefficients - Global Model')
Regression Equation
$ \begin{align} provider\_count = 0.2005*minority\_pop + 0.2312*grad\_deg + 0.7350*total\_population + 0.8371*percap\_income \\ + 3.8329*avg\_medicare + 0.0806*asso\_deg + 0.0412*otherace\_pop + 0.0544*amerind\_pop +0.0640*asian\_pop \\ - 2.1234*avg\_hhsz - 0.9313*median\_age - 4.5444*avg\_presdrug - 0.7842*avg\_socsecurity-0.1044*black\_pop-0.1451*hisp\_pop \end{align} $
# Model Interpretation
coef = 0.2312
((1.01**coef)-1)*100
Model Interpretations:
𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦_𝑝𝑜𝑝- The coefficient is 0.2005. We can say that for 1% increase in minority population, Provider Count will increase by 0.20% ((1.01^coef-1)*100) holding all other predictors fixed.grad_deg- The coefficient is 0.2312. We can say that for 1% increase in graduate degree holders, Provider Count will increase by 0.23% holding all other predictors fixed.𝑡𝑜𝑡𝑎𝑙_𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛- The coefficient is 0.735. We can say that for 1% increase in total population, Provider Count will increase by 0.735% holding all other predictors fixed.𝑝𝑒𝑟𝑐𝑎𝑝_𝑖𝑛𝑐𝑜𝑚𝑒- The coefficient is 0.8371. We can say that for 1% increase in per capita income, Provider Count will increase by 0.84% holding all other predictors fixed.𝑎𝑣𝑔_𝑚𝑒𝑑𝑖𝑐𝑎𝑟𝑒- The coefficient is 3.8329. We can say that for 1% increase in average medicare, Provider Count will increase by 3.83% holding all other predictors fixed.𝑎𝑠𝑠𝑜_𝑑𝑒𝑔- The coefficient is 0.0806. We can say that for 1% increase in associate degree holders, Provider Count will increase by 0.08% holding all other predictors fixed.otherace_pop- The coefficient is 0.0412. We can say that for 1% increase in other race population, Provider Count will increase by 0.04% holding all other predictors fixed.𝑎𝑚𝑒𝑟𝑖𝑛𝑑_𝑝𝑜𝑝- The coefficient is 0.0544. We can say that for 1% increase in american indian population, Provider Count will increase by 0.05% holding all other predictors fixed.asian_pop- The coefficient is 0.0640. We can say that for 1% increase in asian population, Provider Count will increase by 0.06% holding all other predictors fixed.avg_hhsz- The coefficient is -2.1234. We can say that for 1% increase in average household size, Provider Count will decrease by 2.12% holding all other predictors fixed.median_age- The coefficient is -0.9313. We can say that for 1% increase in median age, Provider Count will decrease by 0.93% holding all other predictors fixed.𝑎𝑣𝑔_𝑝𝑟𝑒𝑠𝑑𝑟𝑢𝑔- The coefficient is -4.5444. We can say that for 1% increase in average prescription drug, Provider Count will decrease by 4.54% holding all other predictors fixed.𝑎𝑣𝑔_𝑠𝑜𝑐𝑠𝑒𝑐𝑢𝑟𝑖𝑡𝑦- The coefficient is -0.7842. We can say that for 1% increase in average social security, Provider Count will decrease by 0.784% holding all other predictors fixed.black_pop- The coefficient is -0.1044. We can say that for 1% increase in black population, Provider Count will decrease by 0.104% holding all other predictors fixed.ℎ𝑖𝑠𝑝_𝑝𝑜𝑝- The coefficient is -0.1451. We can say that for 1% increase in hispanic population, Provider Count will decrease by 0.145% holding all other predictors fixed.
source for interpretation: https://www.cscu.cornell.edu/news/statnews/stnews83.pdf
# Calculate RMSE of global model
pred_val = global_ols.fittedvalues.copy()
new_rmse = rmse(train_y_log, pred_val)
new_rmse
We can see that the RMSE of our model is 0.38. Since RMSE has the same unit as the dependent variable, we can compare RMSE with the range of dependent variable to see how spread out our residuals are and see how fit our model is. A value of 0.38 on a range of 0-11 (dependent variable range) is pretty small showing the model is a good fit.
# Get confidence intervals
global_ols.conf_int(alpha=0.05, cols=None)
Here we will verify regression assumptions again
# Residual Value
residual = global_ols.resid
import scipy as sp
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))
# Residuals vs Fitted
ax1.scatter(pred_val, residual)
ax1.hlines(y=0, xmin=0, xmax=12, linewidth=2, color='r')
ax1.set_xlabel('Fitted Values')
ax1.set_ylabel('Residuals')
# QQ plot
sp.stats.probplot(residual, plot=ax2, fit=True)
# Residuals vs Predictors
states = train_x_log.columns
fig = plt.figure(figsize=(15, 15))
for sp in range(0,10):
mini = min(train_x_log.loc[:,states[sp]])
maxi = max(train_x_log.loc[:,states[sp]])
ax = fig.add_subplot(7,4,sp+1)
ax.scatter(train_x_log.loc[:,states[sp]], residual)
ax.hlines(y=0, xmin=mini, xmax=maxi, linewidth=2, color='r')
ax.set_xlabel(states[sp])
ax.set_ylabel('Residuals')
# ax.set_ylim(0,provType.iloc[0,1])
plt.xticks(rotation=0)
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.margins(0.01, 0)
plt.tight_layout() # automatically adjusts layout to fit long labels
plt.show()
The global model helps us identify a relationship between predictor variables and provider count at the country level. However, the dynamics of how different variables impact provider count can be very different for different counties within a state. The question we ask here is:
What if an influential variable's impact on provider count varies accross different counties, i.e. how does average household size (or any other variable's) impact provider count vary across different counties?
To answer this question we will explore Geographically Weighted Regression Model.
Learn more about Geographically Weighted Regression.
To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders. The output of the model will be stored in a file geodatabase which can then be accessed to visualize the results.
from arcpy.stats import GeographicallyWeightedRegression
import arcpy
arcpy.stats.GWR(r"C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_clean_allproviders",
"provider_count", "CONTINUOUS", "asian_pop;amerind_pop;avg_hhsz;black_pop;hisp_pop;median_age;minority_pop; otherace_pop;percap_income;unemp_rate;white_pop;some_college;grad_deg;avg_healthinsurance; avg_medicare;avg_labtest;avg_presdrug;avg_personalinsurance",
r"C:\Users\mohi9282\Desktop\arcgis\GWR_results\GWR_results.gdb\GWR_result_NB",
"NUMBER_OF_NEIGHBORS", "USER_DEFINED", None, None, None, None, None, None, None, 50,
None, None, None, None, "ROBUST", "GAUSSIAN", None)
# Access GWR data from local
gwr_df = pd.DataFrame.spatial.from_featureclass(r'C:/Users/mohi9282/Desktop/arcgis/GWR_results/GWR_results.gdb/GWR_result_NB')
gwr_df.head()
gwr_df.shape
# Create Map
gwr_avg_hhsz_map = gis.map('USA', 4)
gwr_avg_hhsz_map
The map shows how provider count varies with average household size for different counties. We can see:
- Counties in New York, New Jersey and Philadelphia where an increase in average household size will have a higher impact on provider count.
- Similarly, for counties in Washington, Virginia, Maine, New Hampshire, Massachusetts an increase in average household size will have a higher impact on provider count.
- In the Midwest, areas near Chicago and Minneapolis also see a higher impact on provider count.
gwr_avg_hhsz_map.remove_layers()
gwr_df.spatial.plot(map_widget=gwr_avg_hhsz_map,
renderer_type='c', # for class breaks renderer
method='esriClassifyNaturalBreaks', # classification algorithm
class_count=6, # choose the number of classes
col='C_AVG_HHSZ', # numeric column to classify
cmap='OrRd', # color map to pick colors from for each class
alpha=0.7,
line_width=0.1) # specify opacity
# Add Legend
gwr_avg_hhsz_map.legend = True
gwr_avg_hhsz_map2 = gis.map('USA', 6)
gwr_avg_hhsz_map2
gwr_avg_hhsz_map.remove_layers()
gwr_df.spatial.plot(map_widget=gwr_avg_hhsz_map2,
renderer_type='c', # for class breaks renderer
method='esriClassifyNaturalBreaks', # classification algorithm
class_count=6, # choose the number of classes
col='SE_AVG_HHSZ', # numeric column to classify
cmap='OrRd', # color map to pick colors from for each class
alpha=0.7,
line_width=0.01) # specify opacity
gwr_avg_hhsz_map2.legend=True
# Create Map
gwr_avg_presdrug_map = gis.map('USA', 4)
gwr_avg_presdrug_map
The map shows how provider count varies with average prescription drug prices for different counties. We can see:
- Provider Count, in majority of California and areas in New Hampshire, is strongly negatively impacted by increase in average prescription drug prices. This means increase in average prescription drug prices will reduce provider count
- For areas in Minnesota and Wisconsin, provider count is strongly positively impacted by increase in average prescription drug prices.This means increase in average prescription drug prices will increase provider count
- Although for majority of midwest, provider count is positively impacted by average prescription drug prices, some counties in North New Mexico, Southern Colorado and Kansas seem to have a slight negative impact.
gwr_avg_presdrug_map.remove_layers()
gwr_df.spatial.plot(map_widget=gwr_avg_presdrug_map,
renderer_type='c', # for class breaks renderer
method='esriClassifyNaturalBreaks', # classification algorithm
class_count=6, # choose the number of classes
col='C_AVG_PRESDRUG', # numeric column to classify
cmap='OrRd', # color map to pick colors from for each class
alpha=0.7,
line_width=0.05) # specify opacity
gwr_avg_presdrug_map.legend = True
The global model and GWR model assume a linear relationship between predictors and response variable and identifies influential factors at the country level and at county level. However, the question we ask here is:
What if this relationship is not linear?
To answer this question, we will explore Forest Based Classification and Regression Trees Model. Forest based models are good black box models that map non-linear and complex relationships quite well and are not influenced by outliers and missing values to an extent.
Forest Based Classification and Regression Trees creates an ensemble of decision trees. Each decision tree is created using randomly generated portions of data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome. The model will output a variable importance table to identify importance of each variable given to the model.
Leran more about Forest Based Classification and Regression Trees.
To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders. The output of the model will be stored in a file geodatabase which can then be accessed to visualize the results.
from arcpy.stats import Forest
arcpy.stats.Forest("TRAIN", r"C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_clean_allproviders",
"provider_count", None, "asian_pop false;avg_healthcare false;avg_healthinsurance false;avg_hhinc false;avg_hhsz false;avg_labtest false;avg_medicare false;avg_personalinsurance false;avg_presdrug false;black_pop false;grad_deg false;hisp_pop false;median_age false;some_college false;unemp_rate false;white_pop false",
None, None, None, None, None, None, None, None, None, r"C:\Users\mohi9282\Desktop\arcgis\RF_results\RF_results.gdb\RF_VImpNB_table",
"TRUE", 100, None, None, 100, None, 30, None, r"C:\Users\mohi9282\Desktop\arcgis\RF_results\RF_results.gdb\RF_R2NB_table",
None, 5, "FALSE")
# RF Vimp Table
rf_df = pd.DataFrame.spatial.from_table(r'C:/Users/mohi9282/Desktop/arcgis/RF_results/RF_results.gdb/RF_VImp_table')
rf_df.head()
rf_df.shape
sns.set(rc={'figure.figsize':(15,8.27)})
g = sns.boxplot(data = rf_df.iloc[:,:-2])
plt.xticks(rotation=90)
We ran 5 iterations of Random Forest model. The box plot shows variable importance of each variable and how importance varies for each iteration. These variables will now be used in the next step of our analysis to understand how they vary with respect to provider count accross various counties in U.S.
# Create dataframe for Best Iteration
best = rf_df[rf_df['BEST_ITE']=='Best Iteration']
best
# Create Plot
sns.set(rc={'figure.figsize':(15,8.27)})
sns.barplot(data=best)
plt.xticks(rotation=90)
plt.title('Variable Importance - Best Iteration', fontsize=16)
The plot shows variable importance as determined by the best iteration of Random Forest model.
Random Forest model generates variable importance and gives us important variables. However, this model does not tell us:
To understand the type and significance of the relations of a predictor variable with respect to provider count at the County level, we will explore Local Bivariate Relations Model.
Local Bivariate Relations analyzes two variables for statistically significant relationships using local entropy. The output can be used to visualize the type of relationship between two variables and explore how their relationship changes across counties.
We will use Local Bivariate Relations to study how relationship of important variables determined by Random Forest model change with respect to Provider Count accross various counties.
Learn more about Local Bivariate Relationships.
To run our analysis, we will use the cleaned demographic and health expenditure data layer demographic_healthexp_clean_allproviders. The output of the model will be stored in a file geodatabase which can then be accessed to visualize the results.
import arcpy
from arcpy.stats import LocalBivariateRelationships
# Local bivariate
arcpy.stats.LocalBivariateRelationships(r"C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_clean_allproviders",
"provider_count", "white_pop", r"C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_white_pop_LBR",
50, 199, "CREATE_POPUP", "90%", "APPLY_FDR", 0.5)
# arcpy.stats.LocalBivariateRelationships(mental_prov_layer, "provider_count","grad_deg", r"C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\mental_grad_deg_LBR",30, 199, "CREATE_POPUP", "90%", "APPLY_FDR", 0.5)
# Access GWR data from local
lbr_whitepop = pd.DataFrame.spatial.from_featureclass(r'C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_white_pop_LBR')
lbr_whitepop.head()
lbr_whitepop.shape
# Counts of different relationship types
lbr_whitepop.LBR_TYPE.value_counts()
# Create Map
lbr_whitepop_map = gis.map('USA', 4)
lbr_whitepop_map
The map shows how relationship of provider count varies with white population for different counties. We can see:
- For most of the counties in United States, the relationship is either Convex or Positive Linear which means that provider count increase with increasing white population either linearly or polynomially.
- Southern parts of New Mexico, some counties in North Texas, Wyoming and Idaho have a concave relationship. This means that provider count increases to a certain extent and then plateaus off at some point.
lbr_whitepop_map.remove_layers()
lbr_whitepop.spatial.plot(map_widget=lbr_whitepop_map,
renderer_type='u', # for class breaks renderer
col='LBR_TYPE', # numeric column to classify
cmap='Spectral', # color map to pick colors from for each class
alpha=0.7,
line_width=0.05) # specify opacity
lbr_whitepop_map.legend = True
# Local bivariate
import arcpy
arcpy.stats.LocalBivariateRelationships(r"C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_clean_allproviders",
"provider_count", "grad_deg", r"C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_grad_deg_LBR",
50, 199, "CREATE_POPUP", "90%", "APPLY_FDR", 0.5)
# Access GWR data from local
lbr_graddeg = pd.DataFrame.spatial.from_featureclass(r'C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_grad_deg_LBR')
lbr_graddeg.head()
lbr_graddeg.shape
lbr_graddeg.LBR_TYPE.value_counts()
# Create Map
lbr_graddeg_map = gis.map('USA', 4)
lbr_graddeg_map
The map shows how relationship of provider count varies with graduate degree holders for different counties. We can see:
- Majority of the counties from North Dakota to Texas share a Convex or Positive Linear relation. This means that provider count increase with increasing graduate degree holders for these counties either linearly or polynomially.
- Majority of East and West coast share a concave relation whcih means that provider count increases with graduater degree holders to a certain extent and then plateaus off at some point.
lbr_graddeg_map.remove_layers()
lbr_graddeg.spatial.plot(map_widget=lbr_graddeg_map,
renderer_type='u', # for unique value renderer
col='LBR_TYPE', # numeric column to classify
cmap='Spectral', # color map to pick colors from for each class
alpha=0.7,
line_width=0.05) # specify opacity
lbr_graddeg_map.legend = True
# Local bivariate
import arcpy
arcpy.stats.LocalBivariateRelationships(r"C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_clean_allproviders",
"provider_count", "some_college", r"C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_some_college_LBR",
50, 199, "CREATE_POPUP", "90%", "APPLY_FDR", 0.5)
# Access GWR data from local
lbr_somecollege = pd.DataFrame.spatial.from_featureclass(r'C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_some_college_LBR')
lbr_somecollege.head()
lbr_somecollege.shape
lbr_somecollege.LBR_TYPE.value_counts()
# Create Map
lbr_somecollege_map = gis.map('USA', 4)
lbr_somecollege_map
The map shows how relationship of provider count varies with some college degree factor for different counties. We can see
- While majority of the counties on West Coast share a Convex relation, counties in Northern and Central California have Positive Linear relation. This means that provider count increase with increasing some college degree holders for these counties either linearly or polynomially.
- Counties in Montana, Wyoming, Colorado and New Mexico share a positive linear relation. THere is a patch of counties in Northern New Mexico and Southern Colorado that have a convex relation.
- While majority of counties on east coast share a positive linear relation, there is a group of counties in Eastern Virginia that have a concave relation of provider count with resect to some college degree factor.
lbr_somecollege_map.remove_layers()
lbr_somecollege.spatial.plot(map_widget=lbr_somecollege_map,
renderer_type='u', # for unique value renderer
col='LBR_TYPE', # numeric column to classify
cmap='Spectral', # color map to pick colors from for each class
alpha=0.7,
line_width=0.05) # specify opacity
lbr_somecollege_map.legend = True
# Local bivariate
import arcpy
arcpy.stats.LocalBivariateRelationships(r"C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_clean_allproviders",
"provider_count", "asian_pop", r"C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_asian_pop_LBR",
50, 199, "CREATE_POPUP", "90%", "APPLY_FDR", 0.5)
# Access GWR data from local
lbr_asianpop = pd.DataFrame.spatial.from_featureclass(r'C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_asian_pop_LBR')
lbr_asianpop.head()
lbr_asianpop.shape
lbr_asianpop.LBR_TYPE.value_counts()
# Create Map
lbr_asianpop_map = gis.map('USA', 4)
lbr_asianpop_map
The map shows how relationship of provider count varies with asian population for different counties. We can see:
- Majority of the counties in U.S. share a concave relation whcih means that provider count increases with asian population to a certain extent and then plateaus off at some point.
- THere are counties in Southern California, Arizona, North Central U.S. and in Texas that share a Convex or Positive Linear relation. This means that provider count increase with increasing asian population for these counties either linearly or polynomially.
lbr_asianpop_map.remove_layers()
lbr_asianpop.spatial.plot(map_widget=lbr_asianpop_map,
renderer_type='u', # for unique value renderer
col='LBR_TYPE', # numeric column to classify
cmap='Spectral', # color map to pick colors from for each class
alpha=0.7,
line_width=0.05) # specify opacity
lbr_asianpop_map.legend = True
# Local bivariate
import arcpy
arcpy.stats.LocalBivariateRelationships(r"C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_clean_allproviders",
"provider_count", "hisp_pop", r"C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_hisp_pop_LBR",
50, 199, "CREATE_POPUP", "90%", "APPLY_FDR", 0.5)
# Access GWR data from local
lbr_hisppop = pd.DataFrame.spatial.from_featureclass(r'C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_hisp_pop_LBR')
lbr_hisppop.head()
lbr_hisppop.shape
lbr_hisppop.LBR_TYPE.value_counts()
# Create Map
lbr_hisppop_map = gis.map('USA', 4)
lbr_hisppop_map
The map shows how relationship of provider count varies with graduate degree holders for different counties. We can see:
- Majority of the counties in Arizona, California, Oregon share Positive Linear relation. This means that provider count increase with increasing hispanic population for these counties.
- Counties in Washington share a convex relation which means that provider count increases polynomially with increasing hispanic population.
- Counties in Midwest share a concave relation whcih means that provider count increases with hispanic population to a certain extent and then plateaus off at some point.
lbr_hisppop_map.remove_layers()
lbr_hisppop.spatial.plot(map_widget=lbr_hisppop_map,
renderer_type='u', # for unique value renderer
col='LBR_TYPE', # numeric column to classify
cmap='Spectral', # color map to pick colors from for each class
alpha=0.7,
line_width=0.05) # specify opacity
lbr_hisppop_map.legend = True
# Local bivariate
import arcpy
arcpy.stats.LocalBivariateRelationships(r"C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_clean_allproviders",
"provider_count", "black_pop", r"C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_black_pop_LBR",
50, 199, "CREATE_POPUP", "90%", "APPLY_FDR", 0.5)
# Access GWR data from local
lbr_blackpop = pd.DataFrame.spatial.from_featureclass(r'C:\Users\mohi9282\Desktop\arcgis\LBR_results.gdb\allprovider_black_pop_LBR')
lbr_blackpop.head()
lbr_blackpop.shape
lbr_blackpop.LBR_TYPE.value_counts()
# Create Map
lbr_blackpop_map = gis.map('USA', 4)
lbr_blackpop_map
The map shows how relationship of provider count varies with black population for different counties. We can see:
- Majority of the counties in U.S. share a concave relation which means that provider count increases with black population to a certain extent and then plateaus off at some point.
- Majority of the counties on west coast share a Positive Linear relation. This means that provider count increases with increasing black population for these counties.
lbr_blackpop_map.remove_layers()
lbr_blackpop.spatial.plot(map_widget=lbr_blackpop_map,
renderer_type='u', # for unique value renderer
col='LBR_TYPE', # numeric column to classify
cmap='Spectral', # color map to pick colors from for each class
alpha=0.7,
line_width=0.05) # specify opacity
lbr_blackpop_map.legend = True